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Abstract 

We survey the different properties of an intuitive notion of redun¬ 
dancy, as a function of the precise semantics given to the notion of 
partial implication. 

The final version of this survey will appear in the Proceedings of 
the Int. Conf. Formal Concept Analysis, 2015. 


1 Introduction 

The discovery of regularities in large scale data is a multifaceted current 
challenge. Each syntactic mechanism proposed to represent such regularities 
opens the door to wide research questions. We focus on a specific sort of 
regularities sometimes found in transactional data, that is, data where each 
observation is a set of items, and defined in terms of pairs of sets of items. 

Syntactically, the fact that this sort of regularity holds for a given pair 
(. X , Y) of sets of items is often denoted as an implication: X —> Y. However, 
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whereas in Logic an implication like this is true if and only if Y holds when¬ 
ever X does, in our context, namely, partial implications and association 
rules, it is enough if Y holds “most of the times” A" does. Thus, in associ¬ 
ation mining, the aim is to find out which expressions of that sort are valid 
for a given transactional dataset: for what A" and what Y, the transactions 
that contain A" “tend to contain” Y as well. 

In many current works, that syntax is defined as if its meaning was suf¬ 
ficiently clear. Then, any of a number of “measures of interestingness” is 
chosen to apply to them, in order to select some to be output by a data 
analysis process on a particular dataset. Actually, the mere notation X —> Y 
is utterly insufficient: any useful perspective requires to endow these expres¬ 
sions with a definite semantics that makes precise how that naive intuition of 
“most of the times” is formalized; only then can we study and clarify the al¬ 
gorithmic properties of these syntactical expressions. Thus, we are not really 
to “choose a measure of interestingness” but plainly to define what X —> Y 
means, and there are many acceptable ways of doing this. 

This idea of a relaxed implication connective is a relatively natural con¬ 
cept, and versions sensibly defined by resorting to conditional probability 
have been proposed in different research communities: a common semantics 
of X —>- Y is through a lower bound on its “confidence”, the conditional prob¬ 
ability of Y given X. This meaning appears already in the “partial implica¬ 
tions” of [27] (actually, “implications partiellcs”, with confidence christened 
there “precision”). Some contributions based on Mathematical Logic develop 
notions related to these partial implications defined in terms of conditional 
probability: see [18]. However, it must be acknowledged that the contribu¬ 
tion that turned on the spotlights on partial implications was (3j and the 
improved algorithm in [3]: the proposal of exploring large datasets in search 
for association rules of high support and confidence has led to huge amounts 
of research since. Association rules are partial implications that impose the 
additional condition that the consequent is a single item. 

Three of the major foci of research in association rules and partial im¬ 
plications are as follows. First, the quantity of candidate itemsets for both 
the antecedent X and, sometimes, the consequent Y grows exponentially 
with the number of items. Hence, the space to explore is potentially enor¬ 
mous: on real world data, very soon we run already into billions of candidate 
antecedents. Most existing solutions are based on the acceptance that, as 
not all of them can be considered within reasonable running times, we make 
do with those that obey the support constraint (“frequent itemsets”). The 
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support constraint combines well with confidence in order to avoid reporting 
mere statistical artifacts [2B] but its major role is to reduce the search space. 
A wide repertory of algorithms for frequent sets and association rule mining 
exists by now [T|. 

Second, many variations have been explored: for instance, cases of more 
complicated structures in the data and, also, combinations with other machine¬ 
learning models or tasks like in [5DJ [H] . 

This paper surveys part of a research line that belongs to a third focus: in 
a vast majority of practical applications, if any partial implication is found at 
all, it often happens that the search returns hundreds of thousands of them. 
It is far from trivial to design an associator able to choose well, among them, 
a handful to show to an impatient user. This is tantamount to modifying 
the semantics of the partial implication connective, by adding or changing 
the conditions under which one such expression is deemed valid and is to be 
reported. Most often, but not always (as we report in Sections [3] and [4j this 
approach takes the form of “quality evaluations” performed to select which 
partial implications are to be highlighted for the user. We do not consider 
this problem solved yet, but deep progresses have been achieved so far; we 
survey a humble handful of those, where the present author was actively 
involved. For a wider perspective of all these three aspects of association 
rule mining, see Part II of [431. 

The main link along this paper can be described informally as follows: 
human intuition, maybe on the basis of our experience with full, standard 
implications, tends to expect that smaller antecedents are better than larger 
ones, and larger consequents are better than smaller ones. We call this 
statement here the central intuition of this paper; many references express, 
in various variants, this intuition (e.g. m m ESI m E21 ES] just to name 
a few). This intuition is only partially true in implications, where the GD 
basis gets to be minimal through the use of subtly enlarged antecedents m- 
This survey paper discusses, essentially, the particular fact that, on partial 
implications, this intuition is both true and false... as a function, of course, 
of the actual semantics given to the partial implication connective. 


2 Notation and Preliminary Definitions 

Our datasets are transactional. This means that they are composed of trans¬ 
actions, each of which consists of an itemset with a unique transaction identi- 
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fier. Itemsets are simply subsets of some fixed set U of items. We will denote 
itemsets by capital letters from the end of the alphabet, and use juxtaposition 
to denote union, as in XY. The inclusion sign as in X C Y denotes proper 
subset, whereas improper inclusion is denoted X C Y. The cardinality of a 
set X (either an itemset or a set of transactions) is denoted |A|. 

2.1 Partial Implications 

As indicated in the Introduction, the most common semantics of partial im¬ 
plication is its confidence: the conditional empirical probability of the con¬ 
sequent given the antecedent, that is, the ratio between the number of trans¬ 
actions in which X and Y are seen together and the number of transactions 
that contain X. We will see below that this semantics may be somewhat 
misleading. In most application cases, the search space is additionally re¬ 
stricted by a minimal support criterion, thus avoiding itemsets that appear 
very seldom in the dataset. 

More precisely, for a given dataset D, consisting of n transactions, the 
supporting set T>x CD of an itemset X is the subset of transactions that 
include X. (For the reader familiar with the FP-growth frequent set miner 
[T9] , these are the same as their “projected databases”, except for the minor 
detail that, here, we do not remove X from the transactions.) 

The support Sx>{X ) = \V x \/n G [0,1] of an itemset X is the cardinality 
of the set of transactions that contain X divided by n; it corresponds to the 
relative frequency or empirical probability of X. An alternative rendering 
of support is its unnormalized version, but some of the notions that will 
play a major role later on are simpler to handle with normalized supports. 
Now, the confidence of a partial implication A" —)■ Y is cx>(X — > Y) = 
St>(XY)/S x>(X): that is, the empirical approximation to the corresponding 
conditional probability. The support of a partial implication X —> Y is 
st>(X —> Y) — st>(XY). In both expressions, we will omit the subscript D 
whenever the dataset is clear from the context. Clearly, Sx> z (X ) = = 

c(Z -> A). 

Often, we will assume that X fl Y — 0 in partial implications X —> Y. 
Some works impose this condition globally; we will mention it explicitly 
whenever it is relevant, but, generally speaking, we allow X and Y to intersect 
or, even, to fulfill X C Y. Note that, if only support and confidence are at 
play, then c v (X —>■ XY) = c D (X —>■ Y) and s v (X ->• AW) = s v (X -> Y). 
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Of course, in practical terms, after a partial implication mining process, only 
the part of Y that does not appear in A" would be shown to the user. 

We do allow A" = 0 as antecedent of a partial implication: then, its 
confidence coincides with the support, cx>(0 —> Y ) = sv(Y), since sx>(0) = 1. 
Allowing Y — 0 as consequent as well is possible but turns out not to be very 
useful; therefore, empty-consequent partial implications are always omitted 
from consideration. All along the paper, there are occassional glitches where 
the empty set needs to require separate consideration. Being interested in the 
general picture, here we will mostly ignore these issues, but the reader can 
check that these cases are given careful treatment in the original references 
provided for each part of our discussion. 

By X =>- Y we denote full, standard logical implication; this expression 
will be called the full counterpart of the partial implication X — y Y. 

2.2 Partial Implications versus Association Rules 

Association rules were defined originally as partial implications X —> Y with 
singleton consequents: |Y| = 1; we abbreviate X —> {A} as A" — y A. This 
decision allows one to reduce association mining to a simple postprocessing 
after finding frequent sets. Due to the illusion of augmentation, many users 
are satisfied with this syntax, but, however, more items in the consequent 
provide more information. 

Indeed, in full implications, the expression (A =>• B) A (A => C) is fully 
equivalent to A =>- BC, and we lose little by enforcing singleton consequents 
(equivalently, definite Horn clauses); an exception is the discussion of minimal 
bases, where nonsingleton consequents allow for canonical bases that are 
unreachable in the Horn clause syntax da- But, in partial implications, 
A —> BC says more than the conjunction of A — y B and A—tC, namely, B 
and C abound jointly in T>a- Whenever possible, A —> BC is better, being 
both more economical and more informative. This can be ilustrated by the 
following example from [8j, to which we will return later on. 

Example 1 Consider a dataset on U = {A, B, C, D, E} consisting of 12 
transactions: 6 of them include all ofU, 2 consist of ABC, 2 more are AB, 
and then one each of CDE and BC. It can be seen that the confidence of 
both B —>■ A and B —> C is 9/11, whereas the confidence of B —>• AC is 8/11. 

Actually, even restricted to association rules, the output of confidence- 
based associators is often still too large: the rest of this paper discusses 
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how to reduce the output with no loss of information, first, and, then, as 
the outcome is often still too large in practice, we will need to allow for a 
carefully tuned loss of information. 

3 Redundancy in Confidence-Based Partial 
Implications 

We start our discussion by “proving correct” our central intuition, that is, 
providing a natural semantics under which that intuition is correct. For this 
section, we work under confidence and support thresholds, and it turns out 
to be convenient to explicitly assume that the left-hand side of each partial 
implication is included in the right-hand side. We force that inclusion using 
notations in the style of A" —y XY. 

Several references (0 for one) have considered the following argument: 
assume that we could know beforehand that, in all datasets, the confidence 
and support of X 0 —» X 0 Y 0 are always larger than or equal to those of 
A] —y X{Y\. Then, whenever we are mining some dataset under confidence 
and support thresholds, assume that we find X\ —* X\Y\. we should not 
bother to report as well A 0 —y XqYq, since it must be there anyhow, and its 
presence in the output is uninformative. In a very strong sense, X 0 — * X 0 Y 0 
is redundant with respect to Xi —> X\Y\. Irredundant partial implications 
according to this criterion are called “essential rules” in [2] and representative 
rules in [21]; we will follow this last term. 

Lemma 1 Consider two partial implications, X 0 —» X 0 Y 0 and A x —> X{Y\. 
The following are equivalent: 

1. The confidence and support of X 0 —> X 0 Y 0 are larger than or equal to 
those of X\ — > X\Y\, in all datasets: for every V, cx>(X 0 —> AdY 0 ) > 
c D {X 1 ->• X{Yf) and s v { A 0 -)• X 0 Y 0 ) > s v (X 1 -> XfiYf). 

2. The confidence of X 0 —> X 0 Y 0 is larger than or equal to that of X± —> 
A]Yi, in all datasets: for every V, cx>(A 0 —> A 0 Yo) > cp( X x —> X{Yf). 

3. A x C A 0 C A 0 Y 0 C A,Y. 

When these cases hold, we say that Ai —> X\Y\ makes A 0 —> A" 0 Yo redun¬ 
dant. The fact that the inequality on support follows from the inequality on 
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confidence is particularly striking. This lemma can be interpreted as proving 
correct the central intuition that smaller antecedents and larger consequents 
are better, by indentifying a semantics of the partial implication connective 
that makes this true and by pointing out that it is not just the consequent 
that is to be maximized, but the union of antecedent and consequent. If 
only consequents are maximized separately, and are kept disjoint from the 
antecedents, then one gets to a quite more complicated situation discussed 
below. 

Definition 1 Fix a dataset and confidence and support thresholds. The rep¬ 
resentative rule basis for that dataset at these support and confidence thresh¬ 
olds consists of those partial implications that pass both thresholds in the 
dataset, and are not made redundant, in the sense of the previous paragraph, 
by other partial implications also above the thresholds. 

Hence, a redundant partial implication is so because we can know be¬ 
forehand, from the information in the basis, that its confidence is above the 
threshold. We have: 

Proposition 1 (Essentially, from J21f.) For a fixed dataset V and a fixed 
confidence threshold 7 : 

1. Every partial implication of confidence at least 7 is made redundant by 
some representative rule. 

2. Partial implication X —> Y with X C Y is a representative rule if and 
only if cx >(X —> Y) > 7 but there is no X' and Y' with X' C A" and 
XY C X'Y' such that c D (X' ->• Y') > 7 , except X = X' and Y = Y'. 

According to statement (3) in Lemma [lj that last point means that a 
representative rule is not redundant with respect to any partial implication 
(different from itself) that has confidence at least 7 in the dataset. It is 
interesting to note that one does not need to mention support in this last 
proposition, the reason being, of course, statement (2) in Lemma [0 The fact 
that statement (3) implies statement (1) was already pointed out in jUHTT. T1 
(in somewhat different terms). The remaining implications are from [7]; see 
this reference as well for proofs of additional properties, including the fact 
the representative basis has the minimum possible size among all bases for 
this notion of redundancy, and for discussions of other related redundancy 
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notions. In particular, several other natural proposals are shown there to 
be equivalent to this redundancy. Also [ 8 ] provides further properties of the 
representative rules. These references discuss as well the connection with a 
similar notion in [02]. 

In Example [I] at confidence threshold 0.8, the representative rule basis 
consists of seven partial implications: 0 —)■ C, B — > C, 0 —>■ AB , C — > AB , 
A BC , D ABCE, and E -> ABCD. 

3.1 Quantitative Evaluation of Non-Redundancy: Con¬ 
fidence Width 

Redundancy is a qualitative property; still, it allows for a quantitative discus¬ 
sion. Consider a representative rule X —> XY: at confidence c(X —> XY '), no 
partial implication makes it redundant. But we could consider now to what 
extent we need to reduce the confidence threshold in order to find a partial 
implication that would make this one redundant. If a partial implication of 
almost the same confidence can be found to make X —> XY redundant, then 
our partial implication is not so interesting. According to this idea, one can 
define a parameter, the confidence width [5], that, in a sense, evaluates how 
different is our partial implication from other similar ones. We do not discuss 
this parameter further, but a related quantity is treated below in Section 16.21 

3.2 Closure-Aware Redundancy Notions 

Redundancy of one partial implication with respect to another can be rede¬ 
fined as well in a similar but slightly more sophisticate form by taking into 
account the closure operator obtained from the data (see [T5]). Often, this 
variant yields a more economical basis because the full implications are de¬ 
scribed by their often very short Guigues-Duquenne basis HU; see a s ain 0 
for the details. 


4 Redundancy with Multiple Premises 

The previous section indicates precisely when “one partial implication follows 
logically from another”. It is natural to ask whether a stronger, more useful 
notion to reduce the size of a set of partial implications could be based on 




partial implications following logically from several others together, beyond 
the single-premise case. 

Simply considering standard examples with full implications like Augmen¬ 
tation (from X =»■ Y and X' Y' it follows XX' =>- YY') or Transitivity 
(from A" =>■ Y and Y =$> Z it follows A" =>■ Z), it is easy to see that these 
cases fail badly for partial implications. Indeed, one might suspect, as this 
author did for quite some time, that one partial implication would not follow 
logically from several premises unless it follows from one of them. 

Generally speaking, however, this suspicion is wrong. It is indeed true for 
confidence thresholds 7 G (0,0.5), but these are not very useful in practice, 
as an association rule X —> A of confidence less than 0.5 means that, in T> x , 
the absence of A is more frequent than its presence. 

And, for 7 G [0.5,1), it turns out that, for instance, from A —>■ BC and 
A —> BD it follows ACD —y B , in the sense that if both premises have con¬ 
fidence at least 7 in any dataset, then the conclusion also does. The general 
case for two premises was fully characterized in [7j, but the case of arbitrary 
premise sets has remained elusive for some years. Eventually, a very recent 
result from [5] proved that redundancy with respect to a set of premises 
that are partial implications hinges on a complicated combinatorial property 
of the premises themselves. We give that property a short (if admittedly 
uninformative) name here: 

Definition 2 Let Xi —> Y 1; ..., X k —> Y k be a set of partial implications. We 
say that it is nice if X j =>- Yi,..., X k =>- Y k |= A, ; =>• U, for all i G 1... k, 
where U — A^Yi • • • X k Y k . 

Here we use the standard symbol |= for logical entailment; that is, when¬ 
ever the implications at the left-hand side are true, the one at the right-hand 
side must be as well. 

Note that the definition of nicety of a set of partial implications states a 
property, not of the partial implications themselves, but of their full coun¬ 
terparts. Then, we can characterize entailment among partial implications 
for high enough thresholds of confidence, as follows: 

Theorem 1 Let X\ —> Yi,..., X k —> Y k be a set of partial implications 
with k > 1, candidates to premises, and a candidate conclusion X 0 —* Y 0 . If 
7 > (k — l)/k, then the following are equivalent: 

1. in any dataset where the confidence of the premises X 1 —>■ W,..., X k —> 
Y k is at least 7 , c(A 0 —> Y 0 ) > 7 as well; 
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2. either Y 0 C Xq, or there is a non-empty L C {1... k} such that the 
following conditions hold: 

(a) {X t —> Yi : i G L} is nice, 

(b) U ieiXi C X Q C \J i£L XiYi, 

(c) Y 0 CX 0 U fliGi Yi. 

Interestingly, the last couple of conditions are reasonably correlated, for 
the case of several premises, with the central intuition that smaller an¬ 
tecedents are better than larger ones, and larger consequents are better than 
smaller ones. The premises actually necessary must all include the conse¬ 
quent of the conclusion, and their antecedents are to be included in the 
antecedent of the conclusion. Even the additional fact that the antecedent of 
the conclusion does not have “extra items” not present in the premises also 
makes sense. 

However, there is the additional condition that only nice sets of partial 
implications may have a nontrivial logical consequence, and all this just for 
high enough confidence thresholds. The proof is complex and we refrain 
from discussing it here; see [5], where, additionally, the case of 7 < \jk is 
also characterized and the pretty complicated picture for intermediate values 
of 7 is discussed. 

We do indicate, though, that the notion of “nicety”, in practice, turns out 
to be so restrictive that we have not found any case of nontrivial entailment 
from more than one premise in a number of tests with stardard benchmark 
datasets. Therefore, this approach is not particularly useful in practice to 
reduce the size of the outcome of an associator. 

4.1 Ongoing Developments 

As for representative rules (Subsection 13. 2p . there exists a natural variant of 
the question of redundancy, whereby full implications are handled separately; 
essentially, the redundancy notion becomes “closure-based”. This extension 
was fully characterized as well for the case of two premises in [7], but it is 
current work in progress how to extend the scheme to the case of arbitrary 
quantities of premises. 
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5 Alternative Evaluation Measures 


We move on to discuss how to reinterpret the central intuition as we change 
the semantics of the partial implication connective. Confidence is widely used 
as a definition of partial implication but, in practice, presents two drawbacks. 
First, it does not detect negative correlations; and, second, as already indi¬ 
cated, often lets pass far too many rules and, moreover, fiddling with the 
confidence threshold turns out to be a mediocre or just useless solution. Ex¬ 
amples of both disadvantages are both easy to construct and easy to find on 
popular benchmark datasets. Both objections can be addressed by changing 
the semantics of the expression X — y V, by either replacing the confidence 
measure or by strengthening it with extra conditions. The literature on this 
topic is huge and cannot be reviewed here: see HUES] US and their refer¬ 
ences for information about the relevant developments published along these 
issues. We focus here on just a tiny subset of all these studies. 

The first objection alluded to in the previous paragraph can be naturally 
solved via an extra normalization (more precisely, dividing the confidence by 
the support of the consequent). The outcome is lift , a well-known expression 
in basic probability; a closely related parameter is leverage: 

Definition 3 Assume X n Y = 0. The lift of partial implication X — > Y is 
Tp(X —> Y) — ~ = sp (x )x*J(y) • l evera g e of partial implication 

X -)■ Y is X V (X ™ Y) = sv(XY) — s v {X) x s v (Y). 

If supports are unnormalized, extra factors n are necessary. In case of 
independence of both sides of a partial implication X — > Y, we would have 
s(AW) = s(X)s(V); therefore, both lift and leverage are measuring devia¬ 
tion from independence: lift is the multiplicative deviation, whereas leverage 
measures it rather as an additive distance instead. Leverage was introduced 
in [32i| and, under the name “Novelty”, in [24], and received much attention 
via the Magnum Opus associator [38]. We find lift in the references going 
by several different names: it has been called interest [34] or, in a slightly 
different but fully equivalent form, strength [33]; lift seems to be catching up 
as a short name, possibly aided by the fact that the Intelligent Miner system 
from IBM employed that name. These notions allow us to exemplify that we 
are modifying the semantics of our expressions: if we define the meaning of 
X — > Y through confidence, then partial implications of the form X —> Y 
and X —>■ XY are always equivalent, whereas, if we use lift, then they may 
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not be. Note that, in case X = 0, the lift trivializes to 1. Also, if we are to 
use lift, then we must be careful to keep the right-hand side Y disjoint from 
the left-hand side: X fl Y = 0. 

A related notion is: 

Definition 4 J ~2J^ The relative confidence of partial implication X —> Y, 
also called centered confidence or relative accuracy, isrx>{X —> Y) — cx>( X —* 
Y) - c v (f) -4 Y). 

Therefore, the relative confidence is measuring additively the effect, on 
the support of the consequent Y, of “adding the condition” or antecedent X. 
Since cx>(0 —> Y) = sp(Y), lift can be seen as comparing cv(X —> Y) with 
cd(0 —> Y), that is, effecting the same comparison but multiplicatively this 
time: £(X ->• Y) = = C ^(Y) " = %^y) ■ Also ’ [t is eas y to clieck 

that leverage can be rewritten as A© (A —>■ Y) = s-p(X) x r-p{X —> Y) and is 
therefore called also weighted relative accuracy [24] . Relative confidence has 
the potential to solve the “negative correlation” objection to confidence, and 
all subsequent measures to be described here inherit this property as well. 

An objection of a different sort is that lift and leverage are symmetric. 
As the implicational syntax is asymmetric, they do not fit very well the 
directional intuition of an expression like X —y Y; that is one of the reasons 
behind the exploration of many other options. However, to date, none of the 
more sophisticate attempts seems to have gained a really noticeable “market 
share”. Most common implementations either offer a long list of options of 
measures for the user to choose from (like [13] for one), or employ the simpler 
notions of confidence, support, lift, or leverage (for instance, Magnum Opus 
[38]). We believe that one must keep close to confidence and to deviation 
from independence. Confidence is the most natural option for many educated 
domain experts not specialized in data mining, and it provides actually a 
directionality to our partial implications. 

The vast majority of these alternatives attempt at defining the quality 
of partial implication X —> Y relying only on the supports of A", Y, AY, 
or their complements. One major exception is improvement |T2], which is 
the added confidence obtained by using the given antecedent as opposed to 
any properly smaller one. We discuss it and two other related quantities 
next. They are motivated again by our central intuition : if the confidence of 
a partial implication with a smaller antecedent and the same consequent is 
sufficiently high, the larger partial implication should not be provided in the 
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output. They have in common that their computation requires exploration 
of a larger space, however; we return to this point in the next section. 

5.1 Improvement: Additive and Multiplicative 

The key observation for this section is that A" — > Y and Z —> Y, for Z C A , 
provide different, independent information. From the perspective of confi¬ 
dence, either may have it arbitrarily higher than the other. For inequality 
in one direction, suppose that almost all transactions with A" have Y, but 
they are just a small fraction of those supporting Z, which mostly lack Y ; 
conversely, Y might hold for most transactions having Z, but the only trans¬ 
actions having all of X can be those without Y. In Example [TJ one can see 
that c(0 —>■ BC) < c(A —> BC ) whereas c(0 —» C) > c(B —> C ). 

This fact underlies the difficulty in choosing a proper confidence bound. 
Assume that there exists a mild correlation giving, say, c(Z —> A) = 2/3. If 
the threshold is set higher, of course this rule is not found; but an undesirable 
side effect may appear: there may be many ways of choosing subsets of the 
support of Z, by enlarging it a bit, where Y is frequent enough to pass the 
threshold. Thus, often, in practice, the algorithms enlarge Z into various 
supersets X t so that all the confidences c(Aj —> A) do pass, and then Z —> A 
is not seen, but generates dozens of very similar “noisy” rules, to be manually 
explored and filtered. Finding the appropriate threshold becomes difficult, 
also because, for different partial implications, this sort of phenomenon may 
appear at several threshold values simultaneously. 

Relative confidence tests confidence by a comparison to what happens 
if the antecedent is replaced by one of its subsets in particular, namely 0. 
Improvement generalizes it by considering not only the alternative partial 
implication 0 —> Y but all proper subsets of the antecedent, as alternative 
antecedents, and in the same additive form: 

Definition 5 The improvement A —> Y, where X ^ 0, is i(X —>■ Y) = 
min{c(A —» Y) — c(Z -> Y) \ Z C A}. 

The definition is due to [12], where only association rules are considered, 
that is, cases where |Y| = 1. The work on productive rules |3D] is related: 
these coincide with the rules of positive improvement. In [25], improvement 
is combined with further pruning on the basis of the y 2 value. We literally 
quote from [12|: “A rule with negative improvement is typically undesirable 
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because the rule can be simplified to yield a proper sub-rule that is more 
predictive, and applies to an equal or larger population due to the antecedent 
containment relationship. An improvement greater than 0 is thus a desirable 
constraint in almost any application of association rule mining. A larger 
minimum on improvement is also often justified because most rules in dense 
data-sets are not useful due to conditions or combinations of conditions that 
add only a marginal increase in confidence.” 

The same process, and with the same intuitive justification, can be applied 
to lift, which is, actually, a multiplicative, instead of additive, version of 
relative confidence as indicated above: £{X —> Y ) = c(X —> F)/c(0 —> Y). 
Taking inspiration in this correspondence, we studied in p3j a multiplicative 
variant of improvement that generalizes lift, exactly in the same way as 
improvement generalizes relative confidence: 

Definition 6 The multiplicative improvement of X —> Y, where X ^ 0, is 
m(X ->• Y) = min{c(A ->■ Y)/c{Z Y)\Z c X}. 

In Example [U the facts that c(A — > BC ) = 4/5 and c(0 — > BC ) = 3/4 
lead to i(A — > BC) = 4/5 — 3/4 = 0.05 and m(A —>■ BC) = (4/5)/(3/4) ~ 
1.066. Here, as the size of the antecedent is 1, there is one single candi¬ 
date Z = 0 to proper subset of the antecedent and, therefore, improvement 
coincides with relative confidence, and multiplicative improvement coincides 
with lift. For larger left-hand sides, the values will be different in general. 

5.2 Rule Blocking 

Attempting at formalizing the same part of the central intuition, we proposed 
in [ 6 ] a notion of “rule blocking”, where a smaller antecedent Z C X would 
“block” (that is, suggest to omit) a given partial implication X —> Y. We 
will compare the number of tuples having XY (that is, having Y within 
the supporting set of X) with the quantity that would be predicted from 
the confidence of the partial implication Z —» Y, that applies to a larger 
supporting set: we are going to bound the relative error incurred if the 
support s(A) and the confidence of Z —> Y are employed to approximate the 
confidence of X —> Y. 

More precisely, let c(Z —>- ZY) = c. If Y is distributed along the support 
of A" at the same ratio as along the larger support of Z, we would expect 
s(XY) c x s(X): we consider the relative error committed by c x s(A") 


14 


used as an approximation to s(XY) and, if the error is low, we consider that 
Z —>• Y is sufficient information about X — y Y and dispose of this last one. 

Definition 7 Z C A" blocks X —> Y at blocking threshold e when 

s(XY) - c(Z ->■ y)s(A) 

c(Z -► Y)s(Xj “ C ' 

In case the difference in the numerator is negative, it would mean that 
s(XY) is even lower than what Z —>■ Y would suggest. If it is positive but 
the quotient is low, c(Z Y) x s(X) still suggests a good approximation to 
c(X —> Y), and the larger partial implication X —» Y does not bring high 
enough confidence to be considered besides Z —> Y, a simpler one: it remains 
blocked. But, if the quotient is larger, and this happens for all Z, then X — y 
Y becomes interesting since its confidence is higher enough than suggested by 
other partial implications of the form Z —±Y for smaller antecedents Z. Of 
course, the higher the block threshold, the more demanding the constraint is. 
Note that, in the presence of a support threshold r, s(ZY) > s(XY) > r or 
a similar inequality would be additionally required. The value e is intended 
to take positive but small values, say around 0.2 or lower. In Example [U 0 
blocks A —» BC at blocking threshold 1/15 ~ 0.066. 

Rule blocking relates to multiplicative improvement as follows: 

Proposition 2 The smallest blocking threshold at which X —y Y is blocked 
is m(X —> Y) — 1. 

Proof As everything around is finite, this is equivalent to proving that Z C 
X blocks X —> Y at block threshold e if and only if — 1 < e, for all 

such Z. Starting from the definition of blocking, multiplying both sides of 
the inequality by c(Z —> Y), separating the two terms of the left-hand side, 
replacing s(XY)/s(X) by its meaning, c(A —> Y), and then solving Erst for 
c(Z —> Y) and finally for e, we fold the stated equivalence. All the algebraic 
manipulations are reversible. ■ 

5.3 Ongoing: Conditional Weighted Versions of Lift 
and Leverage 

We propose here one additional step to enhance the flexibility of both lift 
and leverage by considering their action, on the same partial implication, but 
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with respect to many different subsets of the dataset, and under a weighting 
scheme that leads to different existing measures according to the weights 
chosen. 

For a given partial implication X —> Y, we consider many limited views 
of the dataset, namely, all its projections into subsets of the antecedent. 
We propose to measure a weighted variant of the lift and/or the leverage 
of the same partial implication in all these projections, and evaluate as the 
quality of the partial implication the minimum value thus obtained. That 
is, we want our high-quality partial implications not only to have high lift or 
leverage, but also to maintain it when we consider projections of the dataset 
on the subsets of the antecedent. We call the measures obtained conditional 
weighted lift and leverage. 

Definition 8 Assume X fl Y = 0. Let w be a weighting function associating 
a weight (either a positive real number or oo) to each proper subset of X. 
The conditional weighted lift of partial implication X —> Y is F Vw (X —> 
Y) = min {w(Z)i T , z (X —>Y) \ Z C X}. The conditional weighted leverage 
of partial implication X — > Y is \' Vw (X —> Y) = nnn{w(Z)g T>z (X —> Y) \ 
ZCXj. 

These notions can be connected to other existing notions with unificatory 
effects. We only state here one such connection. Further development will 
be provided in a future paper in preparation. 

Proposition 3 For inverse confidence weights, conditional weighted leverage 
is improvement: for all X —> Y, \' Vw (X — > Y) — i{X — > Y) holds for the 
weighting function w r (Z) = cx>(Z —> X)~ l . 

6 Support Ratio and Confidence Boost 

From the perspective of our central intuition, the previous section has devel¬ 
oped, essentially issues related to smallish antecedents. This is fully appro¬ 
priate for the discussion of association rules, which were defined originally as 
partial implications with singleton consequents. We now briefly concentrate 
on largish consequents, and then join both perspectives. 
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6.1 Support ratio 

The support ratio was employed first, to our knowledge, in [23], where no 
particular name was assigned to it. Together with other similar quotients, it 
was introduced in order to help obtaining faster algoritlnnics. 


Definition 9 In the presence of a support threshold r, the support ratio of 
a partial implication X —> Y is 


a(X -> Y) 


_ S(XY) _ 

m&x[s(Z)\XY c Z, s(Z) > r}' 


We see that this quantity depends on XY but not on the antecedent X 
itself. In Example [1] we find that a (A —» BC) = 4/3. 


6.2 Confidence Boost 

Definition 10 The confidence boost of a partial implication X —>Y (always 
with lnh = 0j is p(x Y) = 


c(X -> XY) 

max{c(X' -► X'Y') | (X XY) p ( X' -+ X'Y'), X' C X, Y C Y'}' 

where the partial implications in the denominator are implicitly required to 
clear the support threshold, in case one is enforced: s{X' —> X'Y') > r. 

Let us explain the interpretation of this parameter. Suppose that fd(X —» 
Y) is low, say /3(X —?• Y) < b, where b is just slightly larger than 1. Then, 
according to the definition, there must exist some different partial implica¬ 
tion X' — > X'Y' , with X' C X and Y C X'Y', such that or 

c(X' — y Y') > c(X —> Y)/b. This inequality says that the partial implica¬ 
tion X' —> Y', stating that transactions with X' tend to have X'Y' , has a 
confidence relatively high, not much lower than that of X —> Y ; equivalently, 
the confidence of X —y Y is not much higher (it could be lower) than that 
of X’ —y Y'. But all transactions having X do have X', and all transactions 
having Y' have Y, so that the confidence found for X —> Y is not really that 
novel, given that it does not give so much additional confidence over a partial 
implication that states such a similarly confident, and intuitively stronger, 
fact, namely X' —>■ Y'. 
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This author has developed a quite successful open-source partial impli¬ 
cation miner based on confidence boost (yacaree.sf.net); all readers are 
welcome to experiment with it and provide feedback. We note also that the 
confidence width alluded to in Section 13.11 while having different theoreti¬ 
cal and practical properties, is surprisingly close in definition to confidence 
boost. See |Sj for further discussion of all these issues. Confidence boost fits 
the general picture as follows: 

Proposition 4 (5{X —> Y) = min{cr(X —> Y),m(X —> Y)}. 

The inequalities f3(X — > Y) < a(X —> Y) (due to (.11]) and /3(X —> Y) < 
m(X —> Y) are simple to argue: the consequent leading to the support ratio, 
or the antecedent leading to the multiplicative improvement, take a role in the 
denominator of confidence boost. Conversely, taking the maximizing partial 
implication in the denominator, if it has the same antecedent X then one 
obtains a bound on the support ratio whereas, if the antecedent is properly 
smaller, a bound on the multiplicative improvement follows. 

In Example [lj since a(A —> BC ) = 4/3 and m(A —> BC) = (4/5)/(3/4), 
which is smaller, we obtain /3(A —> BC) = (4/5)/(3/4) fa 1.066. 

A related proposal in [22J suggests to minimize directly the antecedents 
and maximizing the consequents, within the confidence bound, and in a con¬ 
text where antecedents and consequents are kept disjoint. This is similar to 
statement (3) in Lemma [U except that, there, one maximizes jointly con¬ 
sequent and antecedent. If consequents are maximized separately, then the 
central intuition fails, but there is an interesting connection with confidence 
boost; see pj]. 

The measures in this family of improvement, including conditional weighted 
variants and also confidence boost, tend to require exploration of larger spaces 
of antecedents compared to simpler rule quality measures. This objection 
turns out not to be too relevant because human-readable partial implica¬ 
tions have often just a few items in the antecedent. Nontrivial algorithmic 
proposals for handling this issue appear as well in [8j. 

6.3 Ongoing Developments 

We briefly mention here the following observations. First, like in Section 1X21 
a variant of confidence boost appropriate for closure-based analysis exists [ 8 ]. 
Second, both variants trivialize if they are applied directly, in their literal 
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terms, to full implications. However, the intuitions leading to confidence 
boost can be applied as well to full implications. In future work, currently 
in preparation, we will discuss proposals for formalizing the same intuition 
in the context of full implications. 

7 Evaluation of Evaluation Measures 

We have covered just a small fraction of the evaluation measures proposed to 
endow with useful semantics the partial implication connective. All of those 
attempt, actually, at capturing a potential (but maybe nonexisting) “naive 
concept” of interesting partial implication from the perspective of an end 
user. Eventually, we would like to find one such semantics that fits as best 
as possible that hypothetical naive concept. 

We can see no choice but to embark, at some point, in the creation of 
resources where, for specific datasets, the interest of particular implications 
is recorded as per the assessment of individual humans. Some approxima¬ 
tions to this plan are Section 5.2 of [S], where the author, as a scientific 
expert, subjectively evaluates partial implications obtained from abstracts 
or scientific papers; a similar approach in [14) using PKDD abstracts; and 
the work in [TOj M] where partial implications found on educational datasets 
from university course logs are evaluated by the teachers of the correspond¬ 
ing courses. These preliminary experiments are positive and we hope that a 
more ambitious attempt could be made in the future along these lines. 

The idea of evaluating associators through the predictive capabilities of 
the rules found has been put forward in several sources, e.g. [29]. The us¬ 
age of association rules for direct prediction (where the “class” attribute is 
forced to occur in the consequent) has been widely studied (e.g. [H]). In 
[29], two different associators are employed to find rules with the “class” 
as consequent, and they are compared in terms of predictive accuracy. This 
scheme is inappropriate to evaluate our proposals for the semantics of partial 
implications, because, first, we must focus on single pairs of attribute and 
value as right-hand side, thus making it useless to consider larger right-hand 
sides; and, also, the classification will only be sensible to minimal left-hand 
sides independently of their confidences. 

In [9], we have deployed an alternative framework that allows us to eval¬ 
uate the diverse options of semantics for association rules, in terms of their 
usefulnes for subsequent predictive tasks. By means of a mechanism akin 
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to the AUC measure for predictor evaluation, we have focused on potential 
accuracy improvements of predictors on given, public, standard benchmark 
datasets, if one more Boolean column is added, namely, one that is true ex¬ 
actly for those observations that are exceptions to one association rule: the 
antecedent holds but the consequent does not. In a sense, we use the associ¬ 
ation rule as a “hint of outliers”, but, instead of removing them, we simply 
offer direct access to this label to the predictor, through the extra column. 
Of course, in general this may lead astray the predictor instead of helping 
it. Our experiments suggest that leverage, support, and multiplicative im¬ 
provement tend to be better than the other measures with respect to this 
evaluation score. 

7.1 Ongoing Developments 

We are currently developing yet new frameworks that, hopefully, might be 
helpful in assessing the relative merits of the different candidates for seman¬ 
tics of partial implications, put forward often as rule quality measures. One 
of them resorts to an empirical application of approximations to the MDL 
principle along the lines of Krimp [37]. A second idea is to make explicit 
the dependence on alternative partial implications, in the sense that X —> Y 
would mean, intuitively, that Y appears often on the support of X and that, 
barring the presence of some other partial implication to the contrary, it is 
approximately uniformly distributed there. These avenues will be hopefully 
explored along the coming months or years. A common thread is that addi¬ 
tional statistical knowledge, along the lines of the self-sufficient itemsets of 
Webb ||0], for instance, is expected to be at play in the future developments 
of the issue of endowing the partial implication connective with the right 
intuitive semantics. 
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